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Introduction 

Wigner matrices have a ubiquitous presence in science; from 
the computation of molecular quantum states, through the de- 
scription of solitons in particle physics and convolution of beam 
and sky algorithms in astronomy, they are needed to sometimes 
very high quantum numbers making fast and accurate algorithms 
that calculate them important. Other methods have been devel- 
oped that calculate these matrices exactly but with sub-optimal 
performance to very high angular momenta fl], or approxi- 
mately but very efficiently Q, but none that calculates them 
exactly and quickly to almost arbitrarily high angular momen- 
tum. Two such methods are presented in this paper and applied 
to a convolution algorithm between beam and sky. The follow- 
ing section gives some basic properties of Wigner matrices, and 
this is followed by a section describing the algorithm. The fourth 
section describes its application to convolution and a summary 
is presented at the end. 



1. Wigner matrices 

Wigner matrix elements^] typically denoted by D' mm ,(a,fl, y), are 
the eigenfunctions of the Schrodinger equation for a symmetric 
top and form an irreducible basis of the Lie group SU(2), and 
the rotation group SO(3); the angles a, fl and y are the Euler 
angles that define the orientation of the top. As basis functions of 
SU(2), the D l ,(a,fl,y) satisfy the standard angular momentum 



relations 



X 



J 2 D< mm ,(a,fl,y) = /(/ + l)D l mm ,(a,fl,y) 
hD' mm ,(a,p,y) = mD l mm ,(a,fl,y) 
l f D l lm ,,(a,fl,y) = m' D' mmr (a,fl,y) , 



(1) 

(2) 
(3) 



where / labels the irreducible representation of SU(2) and also 
corresponds to the quantum number representing the total an- 
gular momentum of the eigenfunction; -I < m, m' < I are the 
quantum numbers representing the projections of the total angu- 
lar momentum on two z-axes rotated with respect to each other 
as described below. 

The Euler angles are defined as three rotations: a rotation y 
about the z-axis that rotates the x and y axes — > x' and y'; this 
first rotation is followed by a rotation fl about the new y'-axis 
rotating x' and z axes — > x" and z'; the final rotation a is about 
z'. In the basis we are using as defined by Eq. |2]) and Eq. ([5]), the 
operators J- and J z > are diagonal and D l mm ,(a,j3,y) has the form 



D l mmr {a,p,y) = e- ma d l mm We 



^> where 



d mnAP) = (/W|exp 
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\ltri) . 



(4) 



(5) 



d l , (fl) is called the reduced Wigner matrix element and consists 

mm y " ' ° 

of the overlap of a spherical harmonic with another spherical 
harmonic that has been rotated by an angle fl about the y-axis. 
The differential equation satisfied by d' mm ,(fl) is 



d 2 d' ,(fl) 
dfl 2 



d/3 



(6) 



/ 2mm' cos fl 



sin 2 /? 



+ 1(1+1) Km<03) = O. 



From the Schrodinger equation in Eq. it is possible to extract 
3-term recursion relations that relate reduced Wigner matrix el- 
ements that differ in their quantum numbers. In principle, it is 
possible to use such relations to calculate the d' mm ,(fl). 3-term re- 
cursion relations can be unstable, which limits their usefulness 
unless the potential pitfalls are identified and avoided. Two ex- 
amples of these relations are 

-m+^cosfl ^ = 1 V(Z + m , )(Z _„, + 1 ^_ i( 

+ 1 yfc - m ')(l + m' + l)d' mm>+1 (fl) , (7) 

and 



cos/? - 



1(1 + 1) 



V (P- m 2 )( ; 2 _ OT , 2) 

mm(P> 1(21+1) " ,mW 



V[(7+ l) 2 -m 2 ][(l+ l) 2 
(1+ 1)(2/+ 1) 



u mm' V-v 



(8) 



Generally, 3-term recursion relations will have two linearly 
independent solutions, f„ and g„ [|4]; these solutions can be os- 
cillatory or exponentially decreasing or increasing. In the non- 
oscillatory case, /„ is the minimal solution if 



fn 



as n — > oo 



(9) 



while g n is the dominant solution. For solutions to the 
Schrodinger equation, exponentially increasing/decreasing solu- 
tions appear only in the region where a particle can not clas- 
sically exist because of energy conservation, but where a wave 
function can be non-zero in quantum mechanics. In the case of 
a rigid rotor [5|, the kinetic energy of a spherically symmetric 
rotor is: 



2IT = p l p + 



1 



sin 2 /3 



(pl + Pi ~ 2p a py cos/3) 



(10) 



In classical mechanics, p^ 
quantization of Eq. 
P/s — > -id/dfl and p y 



> 0. 



In quantum mechanics, the 
TO) means substituting p a — > -id/da, 
-id/dy. These substitutions combined 
with an eigenfunction of the form (|4]) and the additional substi- 
tution 2IT — > /(7 + l)D l mm ,(a,fl,y) inferred from Eq. |TJl yields 
Eq. (|6jl . Since p^ corresponds to the first two terms of Eq. (|6|, 
one concludes that classically we would have 



/(/ + 1) + 



2mm' cos fl - m 
sin 2 fl 



> . 



(11) 



For a nice review of Wigner matrices, see (3). 



Wherever Eq. (Hi is not satisfied, the solutions will be expo- 
nentially suppressed or divergent. When solving the Schrodinger 
equation for the physical solutions, the divergent solutions are 
simply put to zero. When using the 3-term recursion relations, 
the divergent solution can be 'sniffed' out because of round-off 
errors and the recursions quickly fail. One special case where 
this cannot happen is when m, m' = where Eq. ( fTT) is always 
satisfied since / > 0. In that case, Eq. ([H} is stable and can be 
used to calculate <i' )0 (jS) to very high I extremely accurately. 

For the cases where m, m' + 0, we can still use 3-term re- 
cursion relations provided we do so in the right direction in the 
quantum number being varied. For example, looking at Eq. (|7), 
one can either calculate each d' mm ,(fl) for increasing m' or de- 
creasing m'. In one direction, the divergent solution will be 
growing while in the other it will be shrinking. To determine 
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the direction in which Eq. |7]i is stable, one need only consider 



Eq. (Hi. Assume you are interested in evaluating all the reduced 
Wigner matrix elements d l Qm ,iJ3) for < m! < I using Eq. <|vj>; you 
can choose to begin your recurrence with d' Q0 (/3) and increasing 
m! or begin from d' 0l (J3) and decreasing m! . To use Eq. (J7J) in a sta- 
ble manner, you need start from dLifi) and decrease m! . Putting 
m — in Eq. (Hi yields the new condition 

m' 2 

(12) 



/(/+!)- 



sin 2 /? 



>0. 



where it is seen that as m! increases from (taking for example 
P = n/A), we approach the non-classical and violate Eq. ( 12 1 at 
m! > sin/3 -\Jl(l + 1); increasing m' further means sampling the 
divergent dominant solution of the Schrodinger equation from 
which Eq. dTli is derived. It is then clear that the stable direction 
to use Eq. is for decreasing \m'\. From Eq. ( [TT] ), it is seen 
quite generally that the recursion relations |7]i and ([8]l will be 
stable provided they are used in the direction of decreasing \m'\ 
and increasing / respectively. 

In addition to Eqs |7) and |[8), a third recursion relation in /3 
can be derived by discretizing the derivatives in Eq. (|6|i with the 
relations 



/'(*) = 
fix) 2 



fix + e) - f(x - e) 
2e 

fix + e) + fix~e)~2fix) 



+ Oie 2 f") 



+ (Kef") 



(14) 



Substituting into Eq. (|6]l yields 



e 1 



1 2mm! cos /3 - m 2 - m'' 



sin 2 /? 



(=«-.) 



<Lt<fi- 



+ Oie 3 d'Z 



■€)- 

m 



+ K1 + 1) 
(-*♦.) 



dLW) = 



d l mm ,iJ3 + e)+ (15) 



From Eq. (Hi, it is seen that this recursion relation should be 
used for increasing /3 if < /3 < n/2 and decreasing f3 if 
n/2 < /3 < n. Examples of these conclusions are given in Fig. [T] 
The top plot shows the change in the behavior of d l ,(J3) with 
increasing I as one moves from the non-classical to classical re- 
gions; in that case, the angle /3 was chosen so that Eq. ( 
satisfied only when / > 100. The middle plot shows the variation 
of d' mm ,(J3) with m'. With /3 = 0.52331, / = 1000 and m = 0, 
the transition from non-classical to classical regimes occurs at 
m! = 500. The last plot shows the variation of d l ,(J3) with f3; 
the value of m' =71 was chosen so that the transition from 
non-classical to classical occurs at /3 = n/A. A noticeable fea- 
ture of all three plots is the tallest peak is always the first peak 
after the transition to the classical region. This is qualitatively 
understandable from Eq. |5]) where the reduced Wigner matrix is 
seen to characterize the overlap between two spin states after a 
rotation. In the classical limit of large /, the angle a> of the spin 
direction of a quantum object with the z axis is given by 



\lm) 



(16) 



One might expect that the overlap would be greatest when the 
'classical' spins are aligned after the rotation about the y-axis. 
From Eqs. (Ill and ( 16 1, we can show that this is the case when 



jS = acos 



V/(Z + 1) 



VZ(Z + 1) 



(17) 



In our example, m — and Eq. ( 17 1 reads sin(j8) = m! +1)] 
and the overlap is greatest at the transition point. 



Reduced Wigner matrix elements for variable / and m=10, m'=0, (3=0.0996687 
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Reduced Wigner matrix elements for variable m and 1=500, m'=0, (3=0.52302 




100 200 300 400 500 

m 

Reduced Wigner matrix elements for variable (5 and MOO, m=71, m'=0 




Fig. 1. The top plot shows the variation of 0^(0.0996687) 
for 10 < I < 500; the middle plot shows the variation of 
0^(0.52331) for < m! < 500; the third plot shows the varia- 



H3) tion of d^iP) for < /3 < n/2. 



2. Algorithm 

The evaluation of the d' mm ,i/3) from Eqs. |7j and ^ requires 
starting values for the recursions. For large /"however, those val- 
ues are often vanishingly small and cannot be represented by 
any of the IEEE 754 floating-point data formats which are used 
on practically all current computer hardware. Starting the recur- 
sions with and 1 does not help because there will come a point 
where the d\ nm , (J3) become too big to be represented numerically. 
Two solutions to this problem are presented here. 



is 2.1. d l mm ,Q3) ratios 

From Eq. (rob and the plots of Fig. II] it is clear that the d l mm ,ifi) 
vary smoothly with varying /, m, and j3. As a result, a recursion 
relation of ratios should always be finite in the non-classical re- 
gion where the d mm ,ifi) are not oscillatory, and one only has to 
worry about singularities in the ratios in the classical/oscillatory 
region where the denominators could vanish if evaluated at a 
zero of the d l mm ,(J3). In this ratio-based method, Eqs. ([7J) and ([8j 
can be rewritten 



mm! 

d l , ' 

mm - 1 



(18) 



V(Z + m')il - m' + 1) 



i cosec/3 - m'cotan/? - V(^ ~ m')(l + m! + 1) ™ f '"' +1 



-//+! 



(19) 



(l+l)yJ(P-m' 2 )(l 2 -m 2 ) 



(2Z+l)(mm' -cos/3) - I V[(/+l) 2 -m' 2 ][(/+l) 2 -m 2 ]/- 
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Using 

dl 



d> 



ml-l 

d' 

ml 
ml 



VITSsinjS 
I cos /3 - m 



and 



(I + m+ 1)(Z - m + 1) 



2/+ 1 



[(/ + 1) cos/3 - m] 



(20) 
(21) 



the ratios can be calculated recursively down to m' — if using 
Eq. ( 18 i or up to a / = / max if using Eq. ( 19 1. For example, in the 



case where all the dl , for I > m' > are required, one would 



start with Eq. ( p~8| > to calculate: 



4 



To then calculate the 



4 2 Q6) <Q8) 
408)' 408) 



one would need to know 



"oo- 



Fortunately, the 4 316 easv t0 calculate because their recursion 
relation do not contain exponential solutions as remarked under 
Eq. ([TT}: 



COSMOS) = 



21+r 00 



—d M 
2l+i a °° 



(23) 



Once d' lm (J3) has been calculated, dL can be calculated from the 
ratios; in order, each d' Qm (J3) can be calculated by multiplying 
adjacent ratios until d' Ql (J3) has been evaluated. In the case where 
d\ , for / > m' > are also needed, the set of ratios 

lm ' 



d[,(J3) 4 M 08) d[ 2 (fi) d' u (J3) 



d'(fi) d'{p) 



4 08)' d\ (J3) 



(24) 



are next computed. To then calculate the d[ one needs to know 



d' w . Fortunately, d' Q 
all the d l 



01 



-d l w has previously been calculated and 
, , can be obtained up to dl,. In this fashion, all the 

lm ^ 1/ 

d\ nm ,(§) can be calculated up to a desired m = m max . 

The special and extremely rare case where a ratio 
d l ,,,/d 1 , is infinite can be handled by using Eq. (17b where 

mm+l' mm' J on 

d l ,(B) is set to zero and substituting an infinite ratio and a null 

mm *~ / o 

ratio for a single finite ratio: 



' nun 



V(/ + m')(l ■ 



1) 



d l , d' , 

mm' mm'- 



V(/ - m')(l + m' - 1) 



(25) 



Note that in contrast to the method described in [ 1 1, the column 
of matrix elements d' ml (J3) — > d JJS) can be evaluated without 
having to calculate every single d ,(J3) for I' < I. The same 
tricks can be applied to the recursion relation in I for the evalua- 
tion of d\ ,(J3) — > d l p™(£>). First calculate the column of elements 



"max 



(J3) -> d'^ifi) and then calculate the ratios 



(26) 



//;?' 



'03) C(« 

Knowing d^(fi) allows us to evaluate t/ ; '™" _1 (j8) — > d' lm ,(J3). 

For a particular /3, it is sufficient to compute the elements of 
of' ,08) for < m < I and -I < m' < I to know the entire matrix 

mm v / 

ii'08). To fill out the rest of the matrix, the symmetry relations in 
appendix [A| can be used. 



2.2. Wigner matrix elements by l-recursion 

Another way to deal with the underflow problem is to start from 
Eq. <|8j with the following initialization values 



dim 


= A 


(cos- 


dU m (® 


= A 


( I 3 

(cos- 


d\jP) 


= A 


( p 

cos - 
\ 2 


d\jf» 


= A 


( P 

cos - 

\ 9 



l+m 



) (--ir 

£>i-m . ^>/+m 

2 ( Sm 2) 



l-m 



sin 



(27) 



(28) 



(29) 



(30) 



^22) where A = V(2/)!/(/ + m)!(/ - m)! . As far as equations (27i 



to (30 1 are concerned, the underflow problem can be avoided 
by simply calculating the logarithm of the absolute value of the 
matrix element and storing its sign separately. Equation (27 1, 
e.g., then transforms to 

In 14^03)1 = 0.5(ln((2/)!)-ln((/ + m)!)-ln((Z-m)!) 

+ (/ + m)ln|cos03/2)| +(Z-m)ln|sin08/2)| (31) 

In cases where one of the last two terms is -oo, the recursion 
in / can be stopped immediately, since all subsequent values will 
be zero. 

The logarithms of the faculties are easily precomputed, so 
that the seed value for the recursion can be obtained in O(l) 
operations. 

is in some circumstances much 



Since the result of eq. (31 



smaller than the individual terms on the right-hand side, cancel- 
lation errors may reduce the number of significant digits of the 
result. In order to have the highest accuracy that can be achieved 
without sacrificing too much performance, the computation of 
the seed value is carried out with extended IEEE precision (cor- 
responding to the C++ data type long double). 

The recursion relation (|8]l itself unfortunately cannot be 
computed conveniently in logarithms; therefore a way must be 
found to represent floating point values with an extreme dynamic 
range, which does not incur a high performance penalty. 

This was implemented by representing a floating-point num- 
ber v using an IEEE double precision value d and an integer scale 
«, such that 

v = d-S n , (32) 

where either d = 0, or S _1 < \d\ < S and S (the "scale factor") is 
a positive constant that can be represented as a double-precision 
IEEE value. Using this prescription, v does not have a unique 
representation as a [d, n)-pair, but this is not a problem. 

Similar techniques have been in use since at least three 
decades in numerical algorithms; for a recent example see the 
spherical harmonic transform routines of the HEALPix package. 

It is advantageous to choose a scale factor which is an inte- 
ger power of 2, because multiplying or dividing by such a factor 
only affects the exponent of a floating-point value stored in bi- 
nary format, and is therefore exact (ignoring possible under- or 
overflows). In order to avoid frequent re-scaling of d, the scale 
factor should also be rather large; the value adopted for our im- 
plementation is 2 90 . 

Using this representation for the d' mm ,(J3), the recursion is per- 
formed until either Z max is reached, or the matrix element has be- 
come large enough to be safely represented by a normal double- 
precision variable (the threshold value used in the code is 2~ 900 ). 
In the latter case the remaining computations up to / max are done 
with standard floating-point arithmetic, which is significantly 
faster. 
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3. Convolution 

One area where fast and efficient techniques of computing 
d l mm ,(J3) are particularly valuable is in An convolution (6j. For 
the convolution of two fields b(Q) and s(Q) defined on a sphere, 
the following integral must be calculated: 



(33) 



In the physical application where b(Q) is a beam from a horn 
located on a slowly rotating space telescope that scans the sky 
(denoted s(Q)) as it orbits the sun (WMAP or Planck missions, 
e.g.), a large number of convolutions must be performed to ac- 
count for every possible orientation {a,/3, y) of the satellite 



c(Q') 



d2[R(Q>(Q)]*.s(fl) 



(34) 



where R(Q') is a rotation matrix that rotates the beam to a par- 
ticular orientation of the satellite and is defined 



R(a,ft,y)Y lm (6,<f>) = £ D l m , m (a,ft,y)Y lm ,(6,<P) . 



(35) 



and where the Yi m (Q,<f>) are spherical harmonics. The Y[ m {8,(p) 
are related to the D l ,(a,j3, y) through the relation 



Transform algorithm to perform the summations. The numeri- 
cally expensive part of Eq. ( 37 1 is the computation of Eq. ( [38} 
which scales as (9(/ max /«/,max)- Two separate computations of 
Eq. ( 38 i scale as 0(/ max m/, max ): the computation of the d' mbm ^(ft), 
and the evaluation of the sum over < / < l max for every single 
m, OTsky, and ft. The fast methods described in the previous sec- 
tion are used to compute the d l mb „,^(ft). To evaluate the sum for 
each ft, a massively parallel MPI-based approach is used since 
the C mb „ Uty (ft) are uncorrected between the different ft and can be 
computed by different tasks. Additional acceleration techniques 
for both the computation of the d m>m (fi) and for the evaluation 
of the sums over I are described in the following sub-section. 



3.2. Acceleration techniques for the d l mt>msky (J3) 

In simulations of the measurement of cosmic microwave back- 
ground, convolutions appear repeatedly especially if Monte 
Carlos are required. Since the generation of Wigner matrix el- 
ements is typically the most computationally intensive part of 
the convolution algorithm, a large effort was made to increase 
its efficiency. This has two aspects: first to compute the matrix 
elements as quickly as possible, but also to decide (if possible) 
which matrix elements are too small to contribute measurably to 
the result and skip their generation altogether. 



Yue, 0) = (-!)" 



21+ 1 
An 



(36) 



and can therefore be calculated using the methods described 
above. 

For beams with significant side-lobes stemming from reflec- 
tions of light far away from the line of sight as is the case for 
both the WMAP and Planck missions, the beams can cover a 
significant portion of the sky and full-sky convolutions are nec- 
essary; as shown in ref ||6), such full-sky convolutions are much 
faster when performed in harmonic space instead of pixel space. 
We now describe a very fast and massively parallel method to 
perform full-sky convolutions in harmonic space. 

3.1. CONVIQT 

conviqt (CONvolution Via the Quantum Top equation) is a fast 
An convolution algorithm that relies on fast computational meth- 
ods for reduced Wigner matrix elements. Starting from Eqs. |34} 
and ( 35 1, the beam and sky fields can be expanded on the spher- 
ical harmonic basis to yield 

"Itaiax 'max 

c(a,ft,y) = 2 Z e ~ imbae ~ inhtyrc <«^(P) ■ (37) 



C,n b m^ft) = Y,b\ m / mhm Jfi)s 



(38) 



where b\ nib and s/ msk are the spherical components of the beam 
and sky fields^] Typically, the b[ mb are negligible for some 
mbmax < <s / m ax- Noting that the number of ft angles needed 
for the convolution scales as Z max , the evaluation of Eq. (37 1 
; log(/, 



scales as 0(/ ma x m taax , 



K )) after the use of the Fast Fourier 



2 For ease of reading, only the scalar case is described since the gen- 
eralization to polarized maps and beams is easily accomplished by eval- 



uating Eq. d3 8b for the additional pairs (bf 



■sky 



) and (b' 



3.2.1. Skipping unneeded calculations 

When performing convolutions, especially within the context 
of Monte Carlo simulations where many convolutions with the 
same Z max and mt, max are needed, it is computationally prof- 
itable to skip unneeded calculations]^] Some terms in the sum 



of Eq. (|38|l need not be included because their d l mm (ft) are van- 



ishingly small. To determine which terms to exclude we turn to 
Eq. ( [TT] ) where three general possibilities are considered: 

- rab and m s k y are of similar magnitude and much smaller than 
I. 

- »?b and m s k y are of similar magnitude and of the same order 
of magnitude as I. 

- m\, and m s k y are of widely differing magnitude with one much 
smaller than I and one of similar magnitude. 

In each of these possibilities, the neglected £/'„ b „, sk 08) are those 
evaluated at ft angles that correspond to the non-classical, ex- 
ponentially suppressed region. Before explaining how these 
d l mbin ^{ft) are identified, conviqt's nested structure should be 
described. In conviqt, the outermost loop deals with (which 
ranges from to mi, max \ nested into the mb-loop is the m s k y - 
loop ranging from -m s k y to m s k y ; nested in the msky-loop is the 
loop over the ft processed by that particular taskn Finally, the 
innermost loop is that over / which is where the condition is ap- 
plied. To derive the minimum / such that outside the parameter 
space defined by Z min < / < l mdx , d' mbm ^(ft) is negligible, we use 



3 Note that it is generally not more efficient to evaluate all of the 
^m b m sky (fi) before hand because of the disk space required and the large 
amount of time needed to read them in; it is more efficient to calculate 
them on the fly. 

4 A note to remind the reader that conviqt is parallelized in ft; each 
task will perform the convolutions in a subset of all the complete set of 
f} where C,„ b ,„ sky (ft) must be calculated. 
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Eq. ( fTT) to write 



1 1 
-2 + 2 



1 + 



sin 2 /? 



( m sky + m l ~ 2m sky'Wb COS/?) 



1/2 



4 



m sky + m b 2m s i cy mb cos /? - offset 



sin/3 



(39) 



where offset>0 and ensures that the d l m ^ m ^ (J3) neglected are 
well within the non-classical region and suppressed. Calling 
the larger of \m\,\ or \m^ y \, it is noted that I > m\, ls . To determine 
the offset, we go back to the three possibilities listed above; of 
those, / m ; n will generally equal ;«t,i g m the cases where and 
m s k y are of similar magnitude; only in the third case will we gen- 
erally have Z m ; n > ;«bi g , namely when m\, and m s k y are of widely 
differing magnitude. From ref. |2|, this case can approximately 
be written as a harmonic oscillator wave function: 



-V2, 



(-l)'-^(V/sin^ m ) 
K v (jt) = ( ^2 v v[)- ll2 H v (x)e- xl/2 



(V/08-j8 m ) (40) 
(41) 



where // v (x) is a Hermite polynomial and cos/?,,, = mb//. Since 
we are dealing with orders of magnitude, it is not necessary to 
evaluate Eq. (40 1 exactly to determine offset, only to calculate 
an estimate from the factor exp(-x 2 /2). We have found that us- 
ing offset> /max/20 gives extremely accurate results. Finally 
it is noted that d' nib (0) = S mb „ hky and for that special case we 
put Z m i n > / max when + m S ky and no sum over / is performed. 
Thus, Eq. ([39| is used to estimate whether the absolute values of 



all d l mbnk (jS) for a given combination of /, m\,, m s ^ y and /? lie be- 
low a certain threshold; if this is the case, the generation of these 
values can be skipped entirely. In particular, the most efficient 
version of the code written was one where the d^ m ^ and d£^. 
were pre-calculated for < m\, < m.b m!a , -/ m ax ^ '«sky ^ 'max 
and /? subset for a particular task, and read-in as seeds to the 
recursion relation of Eq. (|8j. That way, none of the unneeded 
d l m b m sky (fi) were calculated during the convolution. This required 
an extra code to pre-compute the d mm ^ (/?). In the end, we opted 
for a single code based on the evaluation of the reduced Wigner 
matrix elements as described in section F2~2l because of the low 
overhead and maintenance as well as the high efficiency. 

In this approach, we calculate all the ^ (/?) on the fly, but 

only include the relevant d l mb „ hky (J3) in the final Hoop. This code 
is self-contained and easier to maintain at a very minimal cost in 
performance. As the d l mtm ^ (/?) recursion is performed, the code 

checks the absolute values of the generated d l ,(J5) and records 
the I index at which a predefined threshold s (typically set to 
10~ 30 ) is crossed for the first time. Due to the limited dynamic 
range of IEEE data types, values below this threshold have no 
measurable influence on the convolution result and can there- 
fore be neglected during the final summation loop, which saves 
a significant amount of CPU time. 



3.2.2. Precomputed values 

In this single code approach, a precomputation strategy well- 
suited to the loop structure was adopted: 

- At the beginning of each run, we compute just once 
ln(cos(/?/2)), ln(sin(/?/2)) and cos/? for all /? at which we 
need the Wigner matrix, and as mentioned above, Inn! up 
to n = 2Z max . 



- Also at the beginning we compute the tables 

Pi = and Qi = + 1) 

for i in the range of to 2/ ma x + 1> which are needed for the 
next precomputation step. 

- inside the second loop (i.e. for every combination of and 
m sky) we compute the tables 

Foj = (I + 1)(2Z + 1)JWj 
Fij = m b m sky /(/(/ + 1)), and 

Pl,l = Ql+m b Ql-m b Ql+m sky Ql-m sk y(l + I)/'- 

After all these preparations, eq. ([8]) boils down to 

C^) = Foj (cos/? - Fij) d l mM (J3) - F 2J d' m ^(J3), (42) 



m b m sky v 



which corresponds to only five quick-to-compute floating-point 
operations. 

The space overhead for the additional tables is 0(l m!lx ), 
which is insignificant compared to the <9(/ max ) memory require- 
ment of the whole convolution code. This also means that for the 
reasonable assumption of Z m ax ^ 10 4 all data required for the re- 
cursion fit conveniently into current processors' Level-2 caches. 



3.2.3. Use of d' mm ,(J3) symmetries 

The use of the d l mm , (ft) symmetries considerably cut the compu- 
tational cost of the full sky convolution. In particular, Eq. (A.6i 
relates the computed values of C„, b ,„ sk (/?) at /? < 7r/2 to those at 



/? > 7r/2. In Eq. (A.6i, the phase factor (-1)' is accounted for 



by splitting the sum in Eq. ( |38| > into even and odd /. The phase 
factor (-l)~ m ' is accounted for by using the relations 



b lm = (-\) m bt m , s lm = (-l) m s 



l-n 



(43) 



and further splitting the odd and even sums of Eq. ( |38| l into real 
and imaginary parts. In addition, the symmetry of Eq. ( A.l i and 
Eq. ( 43 1 can be used to show 



C-m b -m A y(P) — Cm b ,„ st (/?) 



(44) 



speeding up the computation of C mbnisky (J3) by another factor of 
two. 

3.3. Example simulations 

To determine the accuracy of conviqt, a detailed compari- 
son with the stable release of the Levels totalconvolver 
J6][2l currently compiled on the planck cluster at the National 
Energy Research Science Council (NERSC) was performed. 
Levels is a simulation package for the generation of time or- 
dered data (TOD) by the Planck satellite IS). Totalconvolver 
and conviqt both calculate a data cube that is fully compati- 
ble with LevelS. For both codes, data cube is composed of con- 
volved points calculated at a polar angle 9, a longitudinal angle 
<p and a particular beam orientation (a rotation about the beam 
axis) iff. 



3.3.1 . data cube comparison 

For /max = 2000, GRASP beams for LFI-19a, OT femax = 9, offset 
= 30 and a polarized CMB map, there were 153634399 points in 
the data cube. Taking the difference between the totalconvolver 
and conviqt data cubes (residual values below), we had: 



conviqt and totalconvolver timings 



8 10 



10' 




the single code approach to obtain the following table: 



Memory consumption of conviqt and totalconvolver 



~ lo- 
rn 
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UP 



O.N 



1 0-4 



J 1 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
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1 1 1 1 1 1 ' 1 1 
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conviqt 
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,1,1,1,1,1,1,1,1, 
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Fig. 2. The top plot compares the wall clock performance of 
conviqt and totalconvolver at / max intervals of 256 starting at 
^max = 256; the middle plot compares the memory needs in 
GBytes of the two codes. The lower plot shows the ratio of wall 
clock and Gbytes of the two codes. 



offset 


avr 


cr 


Max 


rel 


2000 (exact) 


-1.7e-16 


4.1e-13 


1.4e-ll 


4.0e-8 


30 


-1.2e-16 


1.3e-ll 


5.6e-10 


1.3e-6 


15 


-1.6e-16 


3.0e-9 


l.le-7 


2.9e-4 



where 'avr' refers to the average difference of the two data cubes 
(conviqt(fl, cf>, i/O-totalconvolver(0, <f>, tf/)), cr is the variance 
of that residual data cube, 'Max' refers to the maximum value 
found in the residual data cube, and 'rel' refers to the ratio of cr 
to the variance of the totalconvolver data cube. We see that 
with off set =2000, the two data cubes agree to approximately 
8 significant digits; for offset=30,15 they agree to 6 and 4 
significant digits respectively. 

3.4. Performance 

On a single processor on Jacquard, for lmax=2000 with a 
GRASP beam (LFI-19a) of m bmax = 9, offset=30, MC=T in- 
cluding polarisation, conviqt had the following performance 



Code 


Clock (sees) 


gbytes 


conviqt 


349 


0.37 


totalconvolver 


1120 


0.41 



where 'Clock' refers to the time the it took to complete the con- 
volution according to the wall clock, while 'gbytes' refers to the 
total memory consumed. These numbers were obtained using 
NERSC's Integrated Performance Monitoring (IPM) on a 712- 
CPU Opteron cluster called Jacquard running a Linux operat- 
ing system; each processor on Jacquard runs at a clock speed 
of 2.2GHz with a theoretical peak performance of 4.4GFlop/s. 
Unlike totalconvolver, conviqt is a massively parallel code 
which can be run on machines with distributed memory; running 
it on a single processor shows that conviqt is intrinsically faster 
and more efficient than totalconvolver. The above table is for 
the case where offset < Z max , i.e., the case where conviqt sac- 
rifices precision for the sake of a speedier convolution. 

For the general case where conviqt and 
totalconvolver calculate the same thing, we use 





conviqt 


totalconvolver 


^max 


sees 


GBytes 


sees 


GBytes 


256 


1.9657 


1.51806e-02 


11.726 


2.32534e-02 


512 


9.8691 


1.76401e-02 


54.634 


5.00412e-02 


768 


27.982 


2.16856e-02 


144.51 


9.46579e-02 


1024 


61.797 


2.73190e-02 


294.71 


1.57092e-01 


1280 


117.13 


3.45383e-02 


513.29 


2.37350e-01 


1536 


199.29 


4.33445e-02 


809.51 


3.35554e-01 


1792 


313.27 


5.37376e-02 


1264.1 


4.51582e-01 


2048 


621.26 


6.57196e-02 


1765.2 


5.85376e-01 



conviqt is considerably faster than totalconvolver for 



L 



< 2048; however, because both codes scale as I as 



'max — > °°, the gap between their total wall clock times will 
narrow. It is also seen that conviqt consumes significantly less 
memory. 

The scaling of conviqt timings as a function of the to- 
tal number of processors is very good. For a Z max = 4096 and 
;«b = 14 with polarized beam and sky, the log-log plot in Fig. [3] 
shows a linear relationship up to a convolution distributed on 
128 processors. 
Number of processors seconds 



965.46 



16 

32 



486.29 
247.83 



64 



128.43 



128 



69.06 



192 



52.47 



This plot was obtained using the single code approach run on the 
NERSC cluster called Planck, a 256 cores cluster of Opteron 
2350 2.0GHz processors. To measure the scaling behavior of 
conviqt, no output file was created to avoid skewing the scaling 
law with the time it takes to write the file (tens of seconds for a 
4GB file). As the number of processors increases and the time 
required to perform the convolution diminishes to less than a 
minute, the timings become dominated with operations that have 
nothing to do with the convolution; among these are the reading 
of the input data sets (which are read in full by all MPI tasks), 
the inter-process communication and various calculations which 
are performed redundantly on all tasks, because communicat- 
ing the results would be more expensive. Increasing the number 
of tasks (while keeping the problem size constant) also means a 
smaller number of j3 angles per task, which decreases the achiev- 
able quality of load balancing. In addition, different runs with 
identical inputs show variations of a few seconds in wall clock 
timings that have an increasing relative impact on the decreas- 
ing timings stemming from using larger numbers of processors; 
the most likely explanation for this are differences in the exact 
nature of process startup and disk access, which is not exactly 
reproducible in this kind of computing environment. 



4. Summary 

New algorithms for the efficient and accurate calculation of 
Wigner matrix elements were presented. These algorithms were 
used in a full sky convolution, massively parallel algorithm 
called conviqt that was shown to be significantly more 
efficient and much faster than the only other algorithm currently 
available. 
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Appendix A: 





= (-ir- m 'd i _ m _ m ,(j3) 


(A.l) 


d\ m M 


= (-iT- m 'd l m , m (jl) 


(A.2) 


^mm' \P) 


— d_ m ,_ m (J3) 


(A.3) 


dLA-P) 




(A.4) 




= (-ir- m 'd l mm ,Q3) 


(A.5) 




= (-V> l - m 'dl mm W 


(A.6) 




= (-v l+m 'dL m m 


(A.7) 
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